Goto

Collaborating Authors

 information profile


Error Bounds and Optimal Schedules for Masked Diffusions with Factorized Approximations

arXiv.org Machine Learning

Recently proposed generative models for discrete data, such as Masked Diffusion Models (MDMs), exploit conditional independence approximations to reduce the computational cost of popular Auto-Regressive Models (ARMs), at the price of some bias in the sampling distribution. We study the resulting computation-vs-accuracy trade-off, providing general error bounds (in relative entropy) that depend only on the average number of tokens generated per iteration and are independent of the data dimensionality (i.e. sequence length), thus supporting the empirical success of MDMs. We then investigate the gain obtained by using non-constant schedule sizes (i.e. varying the number of unmasked tokens during the generation process) and identify the optimal schedule as a function of a so-called information profile of the data distribution, thus allowing for a principled optimization of schedule sizes. We define methods directly as sampling algorithms and do not use classical derivations as time-reversed diffusion processes, leading us to simple and transparent proofs.


0c74b7f78409a4022a2c4c5a5ca3ee19-Paper.pdf

Neural Information Processing Systems

Languages vary widely in many ways, including their canonical word order. A basic aspect of the observed variation is the fact that some word orders are much more common than others. Although this regularity has been recognized for some time, it has not been well-explained. In this paper we offer an informationtheoretic explanation for the observed word-order distribution across languages, based on the concept of Uniform Information Density (UID). We suggest that object-first languages are particularly disfavored because they are highly nonoptimal if the goal is to distribute information content approximately evenly throughout a sentence, and that the rest of the observed word-order distribution is at least partially explainable in terms of UID. We support our theoretical analysis with data from child-directed speech and experimental work.


Unsupervised pre-training helps to conserve views from input distribution

arXiv.org Machine Learning

We investigate the effects of the unsupervised pre-training method under the perspective of information theory. If the input distribution displays multiple views of the supervision, then unsupervised pre-training allows to learn hierarchical representation which communicates these views across layers, while disentangling the supervision. Disentanglement of supervision leads learned features to be independent conditionally to the label. In case of binary features, we show that conditional independence allows to extract label's information with a linear model and therefore helps to solve under-fitting. We suppose that representations displaying multiple views help to solve over-fitting because each view provides information that helps to reduce model's variance. We propose a practical method to measure both disentanglement of supervision and quantity of views within a binary representation. We show that unsupervised pre-training helps to conserve views from input distribution, whereas representations learned using supervised models disregard most of them.


Why are some word orders more common than others? A uniform information density account

Neural Information Processing Systems

Languages vary widely in many ways, including their canonical word order. A basic aspect of the observed variation is the fact that some word orders are much more common than others. Although this regularity has been recognized for some time, it has not been well-explained. In this paper we offer an information-theoretic explanation for the observed word-order distribution across languages, based on the concept of Uniform Information Density (UID). We suggest that object-first languages are particularly disfavored because they are highly non-optimal if the goal is to distribute information content approximately evenly throughout a sentence, and that the rest of the observed word-order distribution is at least partially explainable in terms of UID. We support our theoretical analysis with data from child-directed speech and experimental work.